On the size of minimal automata for approximate string matching

نویسنده

  • Nadia El-Mabrouk
چکیده

A natural way to solve the problem of string matching with k mismatches, is to construct a nite automaton recognizing the language L P;k of all strings being at a distance k of the searched pattern P, and to use it for a linear search in the text. The problem of this approach is a high space complexity. In this paper, we show that, even if we consider the minimal DFA recognizing L P;k , the memory space required remains large, and the number of states C of the minimal automaton increases quickly with the size m of P. For a pattern composed of the repetition of one character, the exact number of states is C = ? m+1 k+1. For a pattern composed of characters that are all diierent, an accurate lower bound on C is P k i=1 i b i , where, for all i, b i is the (i + 1) th Catalan number and i is a positive integer depending on m and k. For a random pattern P, a lower bound is obtained by considering the longest preex of P consisting, either in the repetition of a single character, or in characters that are all diierent.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space Complexity of Linear Time Approximate String Matching

Approximate string matching is a sequential problem and therefore it is possible to solve it using nite automata. Nondeterministic nite automata are constructed for string matching with k mismatches and k di erences. The corresponding deterministic nite automata are base for approximate string matching in linear time. Then the space complexity of both types of deterministic automata is calculat...

متن کامل

Reduced Nondeterministic Finite Automata for Approximate String Matching

We will show how to reduce the number of states of nondeterministic nite automata for approximate string matching with k mismatches and nondeterministic nite automata for approximate string matching with k differences in the case when we do not need to know how many mismatches or di erences are in the found string. Also we will show impact of this reduction on Shift-Or based algorithms.

متن کامل

Efficient generation of super condensed neighborhoods

Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Condensed neighborhoods, however, are not a minimal representation of a pattern neighborhood. Super condensed neighborhoods, proposed in this work, are smaller, provably minimal and can be used to locate approximate matches that can later be extended by on-line search. We...

متن کامل

Faster Generation of Super Condensed Neighbourhoods Using Finite Automata

We present a new algorithm for generating super condensed neighbourhoods. Super condensed neighbourhoods have recently been presented as the minimal set of words that represent a pattern neighbourhood. These sets play an important role in the generation phase of hybrid algorithms for indexed approximate string matching. An existing algorithm for this purpose is based on a dynamic programming ap...

متن کامل

Approximate Regular Expression Matching

We extend the de nition of Hamming and Levenshtein distance between two strings used in approximate string matching so that these two distances can be used also in approximate regular expression matching. Next, the methods of construction of nondeterministic nite automata for approximate regular expression matching considering both mentioned distances are presented.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997